Skip to content

Replace Python ANTLR parser with C++ parser and AST builder#589

Closed
javihern98 wants to merge 22 commits intomainfrom
perf/cpp-parser-ast
Closed

Replace Python ANTLR parser with C++ parser and AST builder#589
javihern98 wants to merge 22 commits intomainfrom
perf/cpp-parser-ast

Conversation

@javihern98
Copy link
Copy Markdown
Contributor

Summary

Replace the pure-Python ANTLR parser and AST visitor with a C++ implementation via pybind11, dramatically improving parse + AST construction performance.

Key changes:

  • C++ ANTLR parser: Lexer/parser generated from the same .g4 grammar, compiled as a native pybind11 extension (vtl_cpp_parser.so)
  • C++ AST builder: All 197 visitor methods (Terminals, ExprComponents, Expr, ASTConstructor) ported to C++ — walks the ANTLR parse tree natively and constructs Python AST dataclass instances via pybind11, eliminating boundary-crossing overhead
  • CI updates: Workflows build the C++ extension, cache wheels, and handle MSVC/Linux builds
  • Shutdown safety: Static py::object refs cleaned up via atexit to prevent segfault at interpreter finalization

Performance (MG01 benchmark, 52K-line VTL script):

Phase Time Speedup
Python parser + Python AST (baseline) ~8.4s
C++ parser + Python AST ~3.0s 2.8x
C++ parser + C++ AST (this PR) ~1.7s 5.0x

Checklist

  • Code quality checks pass (ruff format, ruff check, mypy)
  • Tests pass (pytest) — all 4037 tests pass
  • Documentation updated (if applicable)

Impact / Risk

  • Breaking changes? No — the public API (run(), semantic_analysis(), prettify()) is unchanged. The C++ builder produces identical Python AST objects.
  • Data/SDMX compatibility concerns? None — downstream Interpreter/DAG see the same Python dataclass instances.
  • Notes for release/changelog? Build now requires a C++ compiler and CMake. Binary wheels are built via cibuildwheel in CI.

Notes

  • The old Python parser (lexer.py, parser.py, VtlVisitor.py) is removed since the C++ extension fully replaces it.
  • antlr4-python3-runtime dependency is no longer needed at runtime (ANTLR C++ runtime is statically linked).
  • The perf/cpp-parser base branch contains the parser-only work; this branch adds the AST builder on top.

Replace the antlr4-python3-runtime dependency with a C++ ANTLR parser
exposed through pybind11, achieving 95.7% performance improvement
(9.6s → 0.41s for 8000 statements).

Key changes:
- Add C++ ANTLR parser with pybind11 lazy-wrapping bindings
- Refactor ASTConstructor to use (rule_index, alt_index) dispatch
- Switch build backend from poetry-core to scikit-build-core
- Update CI workflows for C++ compilation and cibuildwheel
- Remove antlr4-python3-runtime dependency
- Delete dead files: lexer.py, parser.py, VtlVisitor.py
Fix ruff I001 import sorting errors across 7 files after moving
_cpp_parser into Grammar/. Update cibuildwheel config: test-requires
as array, per-platform before-build commands.
poetry install doesn't invoke scikit-build-core to compile the C++
extension. Use pip install . (with build isolation) after installing
deps to actually build the .so module. Update version.yml to not
require poetry for version extraction.
Testing workflow: copy the compiled C++ extension from site-packages
back to the source tree so mypy can resolve the import.

Ubuntu 24.04: use --no-deps to avoid upgrading system numpy/pandas
which causes binary incompatibility errors.
The YAML folded scalar (>) was preserving leading whitespace in the
python -c command, causing IndentationError. Use single-line command.
Use follow_imports = "silent" for vtlengine.AST.* in mypy config
to suppress errors from AST files when the C++ extension .so isn't
in the source tree (CI builds install to site-packages only).
Remove the copy C++ extension step from the testing workflow.
Instead of silencing all AST modules, target only:
- _cpp_parser: follow_imports=silent (handles missing .so in CI)
- ASTConstructor + ASTConstructorModules: disallow_untyped_calls=false
No need to build the C++ parser just to check version consistency.
Extract __version__ with grep and pyproject version with tomllib,
removing the ANTLR download and pip install steps entirely.
Pure bash version check using grep — no Python, no build needed.
Use actions/cache to store the built wheel keyed on OS, Python
version, and hash of C++ source files. On cache hit, ANTLR download
and C++ compilation are skipped entirely — only the wheel install
runs.
The missing .so cascades errors through all AST files, not just
the constructor. Use follow_imports=silent for vtlengine.AST.*
which matches the existing exclude pattern's intent.
- Define ANTLR4CPP_STATIC to avoid dllimport errors on Windows
- Use /w instead of -w for MSVC warning suppression
- Broaden mypy follow_imports=silent to all vtlengine.AST.*
Not needed for parsing and causes MSVC build errors with
high_resolution_clock on Windows.
ProfilingATNSimulator is referenced by other ANTLR runtime code, so
it can't be excluded. Use /FI"chrono" on MSVC to fix the missing
high_resolution_clock symbol.
Use setup-python's built-in poetry cache for faster dependency
installs. Combine dependency install and wheel install into one step.
Phase 0: Infrastructure - cached py::object refs for all 44 AST classes,
ScalarType classes, Model classes, SemanticError. Helper functions for
token info extraction, node type checking, and Python class construction.

Phase 1: Port all 50 Terminal visitor methods from Terminals.py to C++,
including visitConstant, visitVarID, visitComponentID, visitBasicScalarType,
visitWindowingClause, and all other leaf-level visitors.

Phase 2: Port all 48 ExprComponent visitor methods from ExprComponents.py
to C++, including visitExprComponent (recursive dispatch), all function
component visitors (string, numeric, time, comparison, conditional,
aggregate, analytic), and the cast/eval operators.

All 4037 tests pass. The C++ functions are exposed as pybind11 bindings
but not yet wired into the main AST construction path - the Python
visitor still runs unchanged.
Implement all 81 dataset-level expression visitors in C++:
- visitExpr dispatch and all expression alternatives
- Join functions (inner/left/cross/full join with clause handling)
- Dataset clauses (rename, calc, filter, keep/drop, pivot/unpivot, subspace, aggr)
- String, numeric, comparison, time, conditional functions
- Set functions (union, intersect, diff, symdiff, setdiff)
- Hierarchy and validation functions
- Aggregate and analytic functions with grouping clauses
- de_ruleset_elements dict mutation for validation operators
Implement all 18 top-level visitor methods in C++:
- visitStart entry point with statement collection
- visitStatement dispatch (temporary/persist assignment, define expression)
- Operator definition with parameter items and return types
- Datapoint ruleset definition with signature and rule clauses
- Hierarchical ruleset definition with code item relations
- build_ast() now calls visitStart() for full tree walk
- Initialize missing cached AST class refs (Argument, Operator, HRBinOp, etc.)
- Replace ASTVisitor().visitStart() with C++ build_ast() in create_ast()
- Fix default Windowing to use raw ints (-1, 0) matching Python behavior
  (create_windowing normalizes to strings, but ASTString checks for ints)
- Fix visitSignedInteger overflow: use stoll instead of stoi for large numbers
- Add ASTBuilder::cleanup() to release all cached Python class references
- Add cleanup_phase3() for Phase 3 statics
- Use stoll instead of stoi for large integer support in visitSignedInteger
- Fix default Windowing to preserve raw ints matching Python behavior
…::object refs

Static py::object destructors were calling Py_DECREF after Python interpreter
finalization, causing a segfault in __run_exit_handlers. Replace assignment
with .release() (sets internal PyObject* to nullptr without Py_DECREF) and
register cleanup via atexit from Python side so it runs while the interpreter
is still alive.
Delete ASTConstructor.py, ASTConstructorModules/, _rule_constants.py,
60+ unused visit_* pybind11 bindings, ~190 token/rule constants, and
the backwards-compat ASTVisitor import. Only ML_COMMENT constant kept
(used by ASTComment.py for prettify). Total: -5,942 lines.
@javihern98
Copy link
Copy Markdown
Contributor Author

Closing this PR as the tradeoff between AST building in C++ and in python is negligible. We will not pursue this:

Final results:
image

@javihern98 javihern98 closed this Mar 12, 2026
@javihern98 javihern98 deleted the perf/cpp-parser-ast branch March 12, 2026 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant